Binary Classification using Logistic Regression using numpy

by Pritish Jadhav, Mrunal Jadhav - Sun, 08 Jul 2018

Tags: #python #logistic regression #machine learning #classification #supervised learning

Problem Definition -

Suppose you are an auto enthusiast. You have been capturing the pictures of cars for a long time and putting them in an album labelled as 'Cars'. Here the sorting of the photos is done manually.
To manage your library easily, you would like to automate the process wherein an image captured by camera is scanned. If the image is of a Car, it is automatically organised under the label-'Cars'.

Classification Algorithms -¶

Classification is a branch of supervised machine learning algorithms where target values are discrete.
In a binary classification problem, the output is 1 or 0 (Car or not). This is represented by Bernouli random variable. If can experiment is conducted with probability P, then it can take on 2 values, 1 for success and 0 otherwise. Thus the output is bounded on both ends.

So how do we solve the Classification problems ???

Logistic Regression¶

Logistic Regression is the classification algorithm for linearly seperable data points and dichotomous(binary) outcome. It works by learning the function of P(Y|X).
where Y is the indicator variable with inputs X

Given a data point(x,y), logistic regression assumes : P(Y=1| X=x)

We can say,
P(Y=1|X=x) = E(Y) (conditional expectation of the indicator variable)

Mathematically this is given by, \begin{align}P(Y=1|X=x)\ = \sigma(z) \\ \ where \ z \ is \ given \ by, z= \theta_0 + \sum_{i=1}^{m} \theta_ix_i \\ \end{align}

1. The Sigmoid Function¶

Logistic Regression starts by calculating the odds ratio.

Odds Ratio - Odds Ratio(OR) is defined as the ratio of the probability of success and the probability of failure. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.

The odds ratio is calculated from probability and unlike the probability, it may range from 0 to $\infty$ \begin{align} Odds\ Ratio\ =\ \frac{P}{(1-P)} \end{align} Naturally, the more likely it is for the positive event P(Y=1|X=x) to occur, the larger is the odds ratio.

Now, if we take Log on both sides, we get Logit Function

Logit Function is a monotonic function. The higher the log of odds ratio, the higher is the odds ratio.

It is usually difficult to model a variable which has restricted range, such as probability. This transformation is an attempt to get around the restricted range problem. It maps probability ranging between 0 and 1 to log odds ranging from negative infinity to positive infinity. \begin{align} logit(P)\ =\ log(\frac{P}{(1-P)})\ =log(P)-log(1-P) \\ \end{align}

Therefore, \begin{align} \\ logit(P)\ =\ \theta_0 + \sum_{i=1}^{m} \theta_ix_i \\ \end{align} The plot of the logit function is :

Now, take the inverse logit on both sides,we get sigmoid function \begin{align} logit^{-1}(logit(P))\ =logit^{-1}(\ \theta_0 + \sum_{i=1}^{m} \theta_ix_i) \\ \\ P = \frac{1}{1+e^{-(\theta_0 + \sum_{i=1}^{m} \theta_ix_i})} \\ \\ P = \frac{1}{1+e^{-(z)}} \end{align} The plot of the sigmoid function is :

\begin{align} \\ P\ =\ sigmoid(z)\ = \sigma(z) \\ \end{align}

In Logistic Regression, we use Sigmoid Function to map the real value to a value between 0 and 1. In other words, it maps prediction to probability.

To Summarize the previous Section¶

\begin{align} P(Y=1|X=x)\ =\ sigmoid(z)\ =\ \sigma(z) \end{align}

where,
\begin{align} z\ =\ \theta_0 + \sum_{i=1}^{m} \theta_ix_i\ =\ \theta_0x_0 + \sum_{i=1}^{m} \theta_ix_i\ \ \ (x_0\ =\ 1) \end{align} Therefore,
\begin{align} z\ =\ \sum_{i=0}^{m} \theta_ix_i \end{align} Vectorizing the above equation we can write,
\begin{align} P(Y=1|X=x)\ =\ \sigma(\theta^Tx) \end{align}

2. Deriving the Cost Function¶

The calculation of the probabilities in logistic regression are parametrized by $ \theta $. Our goal is to estimate the values of these parameters. We estimate these my using Maximum Likelihood Estimator(MLE).
We perform this in 2 steps :

Write the likelihood Function
Likelihood Functions measures the goodness of a fit of a statistical mode to a sample of data for given values of unknown parameters.
Find the values of the parameters that maximise the likelihood function

Step 1¶

The labels that we are predicting are binary, and the output of our logistic regression function is supposed to be the probability that the label is one. This means that we can (and should) interpret each label as a Bernoulli random variable: Y ∼ Ber(P) where P = σ(θ^Tx).

The Probability can be written as,
\begin{align} P(Y=1|X=x)\ =\ \sigma(\theta^Tx) \end{align} By the laws of Probability, \begin{align} P(Y=0|X=x)\ =\ 1-\ \sigma(\theta^Tx) \end{align} The compact way of writing these equations for a datapoint (x,y) is :

\begin{align} P(Y=y|X=x)\ =\ \sigma(\theta^Tx)^y\ .\ (1-\ \sigma(\theta^Tx))^{(1-y)} \end{align}

This is also the probability mass function of a Bernoulli. Now that we know the probability mass function we can write the likelihood of the data.

\begin{align} L(\theta)=\ \prod_{i=1}^{m} \ P(Y=y^{(i)}\ |\ X=x^{(i)}) \end{align} Substituting the likelihood of the Bernoulli,
\begin{align} L(\theta)=\ \prod_{i=1}^{m} \ \sigma(\theta^Tx^{(i)})^{y^{(i)}}\ .\ (1-\ \sigma(\theta^Tx^{(i)}))^{(1-y^{(i)})} \end{align} Taking the log of the function, we can get the log likelihood,

\begin{align} LL(\theta)=\ \sum_{i=1}^{m}\ y^{(i)}log\ \sigma(\theta^Tx^{(i)})\ +\ (1-y^{(i)})log\ (1-\ \sigma(\theta^Tx^{(i)})) \end{align}

Step 2¶

Now that we have likelihood function, we need to choose the values of theta that would maximise it. We can find the best values of theta by using an optimization algorithm such as Gradient Descent. In optimization algorithm we calculate the partial derivative of log likelihood with respect to each parameter.

Gradient Descent Works by minimising the result of a function. Minimising the negative log likelihood is equivalent to maximising the log likelihood.
The 1/m is to "average" the squared error over the number of components so that the number of components doesn't affect the function

Therefore, the Cost Function as follows :
\begin{align} J(\theta)=\ -\frac{1}{m}\sum_{i=1}^{m}\ y^{(i)}log\ \sigma(\theta^Tx^{(i)})\ +\ (1-y^{(i)})log\ (1-\ \sigma(\theta^Tx^{(i)})) \end{align} Here are the partial derivatives, we will derive later,

\begin{align} A=\ \sigma(\theta^Tx) \\ \\ \frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m}(A - Y)^T X \\ \\ \frac{\partial J(\theta)}{\partial \theta_0} = \frac{1}{m} \sum_{i =1}^{m} (A - Y^{(i)}) \end{align}

Lets Start by import python libraries that will help us accomplish the task¶

In [1]:

## python libraries
import pandas as pd
import numpy as np

## import sklearn for loading mnist data
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler

from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML("<style>.container { width:100% !important; }</style>"))

$Notations \ that \ we \ will \ be \ using \ throughout \ are \ as \ follows -$

\begin{align}X - Input \ data \end{align}

\begin{align}Y - Target \ labels \end{align}
\begin{align}W - \ weights \ vector \ (to \ be \ estimated \ using \ training \ data) \end{align}
\begin{align}b - \ intercept \ for \ the \ decision \ boundary \end{align}

$Notations \ for \ tracking \ dimensions \ are \ as \ follows-$ \begin{align}m - number \ of \ training \ examples \end{align}
\begin{align}n - number \ of \ input \ features \end{align}
\begin{align}k - \ size \ of \ labels \ vector (In \ case \ of \ binary \ classification,\ k\ =\ 1) \end{align}

Load Cancer dataset provided by sklearn¶

In [2]:

def load_breast_cancer_dataset():
    '''
    This function loads the dataset. It also prints the description of the dataset.
    '''
    data_dict = load_breast_cancer()
    X, Y = data_dict['data'], data_dict['target'] #use transpose to match the dimesions defined above
    Y = Y.reshape(len(Y), 1) # reshape the label vector so that it has a dimesion of k x m
    return X, Y

In [3]:

X, Y = load_breast_cancer_dataset()

print "successfully loaded training data with %d training data and %d features"%(X.shape[0], X.shape[1])

successfully loaded training data with 569 training data and 30 features

Leverage the built in normalization techniques provided by sklearn.

In [4]:

scaler = MinMaxScaler()
X = scaler.fit_transform(X)

print X.shape, Y.shape

(569, 30) (569, 1)

$\ While \ implementing \ logistic \ Regression \ from \ scratch, \ it \ is \ important \ to \ keep \ track \ of \ dimensions.$

$\ For \ this \ tutorial, \ we \ shall \ use \ following \ dimensions -$

\begin{equation*} \tag{1}\label{1} X =\ m \times \ n \ \end{equation*}

\begin{align}\tag{2}\label{2} Y = \ m \times \ 1 \end{align}
\begin{align}\tag{3}\label{3} W = \ 1 \times \ n \end{align}
\begin{align}\tag{4}\label{4}b = \ 1 \times \ 1 \end{align}

The pseudo-code for training a Logistic Regression model is as follows -

Step 1 - Given a Training DataSet X and corresponding labels Y, initialize values of number of training examples(m), number of features(n) and number of output units (k) as mentioned in equations \ref{1}, \ref{2}, \ref{3} and \ref{4}

Step 2 - Initialize Weights matrix W and bias b using dimensions from Step 1.

\begin{equation} for\ \ \ i_{epoch}\ \ \ in\ \ \ range(iterations):\end{equation}

Step 3 - Forward Propagation -
Step 3.1 compute preactivations using \ref{5}

\begin{align} pre\_activations \ (Z) = XW^T +b \tag{5}\label{5} \end{align}

Lets validate if the dimensions match - \begin{equation} preactivations \ (Z) = \mathbf{(\ m \times \ n )} \times \ \mathbf{(1 \times \ n)}^\intercal \ + \ (1 \times 1) \end{equation}

We are good to move ahead.

Step 3.2 - Compute Activation using equation \ref{6}

\begin{align} activations \ (A) = \sigma(Z) = \ \frac{1}{1 + e^{-(Z)}} \tag{6}\label{6} \end{align}

Note that the dimensions of activations are same as dimensions of preactivations.

Step 3.3 - Compute loss /error /cost using \ref{7} \begin{align} cost = - \ \frac{1}{m} \sum_{i =1}^{m} y_i \log(A_i) + \ (1-y_i) \log(1-A_i) \tag{7} \label{7} \end{align}

It is important to note that, cost/error/loss is a scaler which we would like to minimize using gradient descent.

Step 4 - Back Propagation

We shall derive the grdients using chain rule -

\begin{equation} \frac{\partial L}{\partial W} \ = \ \frac{\partial L}{\partial A} \frac{\partial A}{\partial Z} \frac{\partial Z}{\partial W} \tag{8}\label{8} \end{equation}

Lets Start by computing the partial derivative - $\frac{\partial L}{\partial A}$
\begin{align} \frac{\partial L}{\partial A} & = \frac{\partial }{\partial A} \Big[ - \ \sum_{i =1}^{m} y_i \log(A_i) + \ (1-y_i) \log(1-A_i)\Big] \end{align}

For one training example $i$ -

\begin{align} \frac{\partial L}{\partial A} & = - \Big[y^{(i)} \frac{\partial log(A)}{\partial A} + (1-y^{(i)}) \frac{\partial log(1-A)}{ \partial A} \Big] \\ & = - \Big[ \frac{y^{(i)}}{A} - \frac{1-y^{(i)}}{1-A}\Big] \tag{9} \label{9} \end{align}

Now lets compute the partial derivative $\frac{\partial A}{\partial Z}$
\begin{align} \frac{\partial A}{\partial Z} & = \frac{\partial }{\partial Z} \Big[ \frac{1}{1 + e^{-(Z)}} \Big] \\ & = \frac{\partial }{\partial Z} \Big[ \Big(1+ e^{(-Z)}\Big)^{-1} \Big] \\ & = - \Big(1+ e^{-z})^{-1} \Big)^{-2} \frac {\partial}{\partial Z} \Big( 1+ e^{-Z} \Big) \\ & = \frac{- e^ {-z}} {(1+ e^{-z})^2} \frac{\partial }{\partial Z} \Big(-Z \Big) \\ & = \frac{e^{-z}}{(1+ e^{-Z})^2} \tag{10}\label{10} \end{align}
Adding and Subtracting 1 in numerator of equation \ref{10} \begin{align} \frac{\partial A}{\partial Z} & = \frac{1+e^{-Z} -1}{(1+e^{-Z})^2} \\ & = \frac{1 +e^{-Z}}{(1+e^{-Z})^2} - \frac{1}{(1+e^{-Z})^2} \\ & = \frac{1}{(1+e^{-Z})} - \frac{1}{(1+e^{-Z})^2} \tag{11}\label{11} \end{align}

from equation \ref{6} and \ref{11}, we get -

\begin{align} \frac{\partial A}{\partial Z} & = A- A^2 \\ & = A(1-A) \tag{12}\label{12} \end{align}

Multiply equations \ref{9} and \ref{12}, we get

\begin{align} \\ \frac{\partial L}{\partial Z} & = \frac{\partial L}{\partial A} \frac{\partial A}{\partial Z} \\ & = - \Big[ \frac{y^{(i)}}{A} - \frac{1-y^{(i)}}{1-A}\Big] A(1-A) \\ & = - \Big[ y^{(i)}(1 - A) - A (1 - y^{(-i)}) \Big] \\ & = - \Big[ y^{(i)} - \require{cancel} \cancel{y^{(i)}A} - A + \require{cancel} \cancel{y^{(i)}A }] \\ & = \Big[ A - y^{(-i)} \Big] \end{align}

Now, The last part of this derivation to compute gradients with respect to weights/coeffiecients.

\begin{align}\\ \frac{\partial L}{\partial W} & = \frac{\partial L}{\partial Z} \frac{\partial Z}{\partial W} \\ & = \frac{\partial L}{\partial Z} \Big[ \frac{\partial Z}{\partial W_1} \frac{\partial Z}{\partial W_2} .... \frac{\partial Z}{\partial W_n}\Big] \end{align}

The derivatives $\Big[ \frac{\partial Z}{\partial W_1} \frac{\partial Z}{\partial W_2} .... \frac{\partial Z}{\partial W_n}\Big]$ can be easily calculated as follows -

\begin{align} \\ \frac{\partial Z}{\partial W_1} & = \frac{\partial }{\partial W_1} \Big(x_1w_1 +x_2w_2 + ....+x_nw_n + b \Big) \\ & = x_1 \end{align}

Similarly, \begin{align} \\ \frac{\partial Z}{\partial W_2} = x_2 \\ \frac{\partial Z}{\partial W_3} = x_3 \\. \\. \\. \frac{\partial Z}{\partial W_n} = x_n \end{align}

Now, the gradient of bias term b is simply - \begin{align}\\ \frac{\partial L}{\partial b} & = \frac{\partial L}{\partial Z} \frac{\partial Z}{\partial B} \\ & = (A - Y^{(i)}) \frac{\partial }{\partial b} \Big(x_1w_1 +x_2w_2 + ....+x_nw_n + b \Big) \\ & = (A - Y{(i)}) (1) \end{align} Therefore, for one training example, the gradient wrt weights can be concisely computed using -

\begin{align} \\ \frac{\partial L}{\partial W} = (A-Y^{(i)}) [x_1^{(i)} x_2^{(i)} x_3^{(i)} .... x_n^{(i)}] \\ \frac{\partial L}{\partial b} = (A-Y^{(i)}) \end{align}

By extending the computations for one training example to m training examples, we get -

\begin{align} \\ \frac{\partial L}{\partial W} = \sum_{i =1}^{m} \frac{\partial {L^{(i)}}}{\partial W} \tag{13}\label{13} \\ \frac{\partial L}{\partial b} = \sum_{i =1}^{m} \frac{\partial {L^{(i)}}}{\partial b} \tag{14}\label{14} \end{align}

Equations \ref{13} and \ref{14} can be summarised as follows -

\begin{align} \\ \frac{\partial L}{\partial W} = \frac{1}{m}(A - Y)^T X \tag{15}\label{15} \\ \frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i =1}^{m} (A - Y^{(i)}) \tag{16}\label{16} \end{align}

Step 4.1 The weights can be updated using gradient descent equations as follows - \begin{align} \\ W = W - \frac{\partial L}{\partial W} \tag{17} \label{17} \\ b = b - \frac{\partial L}{\partial b} \tag{18} \label{18} \end{align}

In [5]:

def initialize_dimensions(X, Y):
    '''
    This function initializes the values of m, n and k
    m -> number of training examples
    n -> number of features
    k -> number of output labels (for binary classification, k= 1)
    
    '''
    m, n = X.shape
    k = Y.shape[1]
    return m, n, k

In [6]:

def initialize_weights_with_zeros(dim):
    '''
    This function initializes the weights vector and bias vector
    
    The genaral formula for Weights matrix and bias vector is - 
    W --> number of output units x number of input features
    b --> 1 x number of output units
    
    '''
    W = np.zeros(shape = (dim))
    b = np.zeros(shape = (1, dim[0]))
    return W, b

The formula for sigmoid is given by -¶

\begin{equation} \sigma (z) = \ \frac{1}{1 + e^{-z}} \end{equation}

In [7]:

def sigmoid(z):
    '''
    This function computes the sigmoid of vector (numpy array).
    '''
    return 1.0/(1 + np.exp(-1.0 *z))

The cost for logistic regression is given by -¶

\begin{equation} cost\ (L) = - \ \frac{1}{m} \sum_{i =1}^{m} y_i \log(A_i) + \ (1-y_i) \log(1-A_i) \end{equation}

$Where, \\ Y = vector\ of\ size\ m \times 1. \\ A = activations = \sigma {(z)} -\ vector\ of\ size\ m\ \times\ 1$

In [8]:

def compute_sigmoid_cost(Y, A, m):
    cost = -1.0/m *np.sum((Y* np.log(A) + ((1-Y)*np.log(1-A))), axis = 0)
    return cost

In [9]:

def compute_sigmoid_gradients(x, activations, y, m):
    dw = 1.0/m * np.dot((activations-Y).T, X)
    db = 1.0/m * np.sum((activations-Y), axis = 0, keepdims = True)
    return dw, db

In [10]:

def propagation(w, b, X, Y, m):
    # FORWARD PROPAGATION (FROM X TO COST)

    z = (np.dot(X, w.T) + b)
    activations = sigmoid(z)
    cost = compute_sigmoid_cost(Y, activations, X.shape[0])
    # BACKWARD PROPAGATION (TO FIND GRAD)
    dw, db = compute_sigmoid_gradients(X, activations, Y, m)

    cost = np.squeeze(cost)
    
    grads = {"dw": dw,
             "db": db}
    
    return grads, cost

In [17]:

def train_logistic_regression(X, Y, learning_rate = 0.2, n_epochs = 3000, print_cost = True):
    m,n,k  = initialize_dimensions(X, Y)
    w, b = initialize_weights_with_zeros((k, n))
    
    costs = []
    for iteration in range(n_epochs):
        grad, cost = propagation(w, b, X, Y, m)
        
        dw = grad['dw']
        db = grad['db']
        
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        # Record the costs
        if iteration % 10 == 0:
            costs.append(cost)

        # Print the cost every 100 training examples
        if print_cost and iteration % 500 == 0:
            activations = sigmoid(np.dot(X, w.T) + b)
            y_pred = np.where(activations > 0.5, 1, 0)
            accuracy = (float(np.sum(y_pred[:,0] == Y[:,0]))/ m)* 100
            display("Cost after iteration %i: %f | accuracy after iteration %i: %f" % (iteration, cost, iteration, accuracy))

    params = {"w": w,
              "b": b}

    grads = {"dw": dw,
             "db": db}

    return params, grads, costs,y_pred

In [18]:

kl = train_logistic_regression(X, Y, print_cost= True)

'Cost after iteration 0: 0.693147 | accuracy after iteration 0: 65.026362'

'Cost after iteration 500: 0.219201 | accuracy after iteration 500: 94.376098'

'Cost after iteration 1000: 0.169860 | accuracy after iteration 1000: 95.254833'

'Cost after iteration 1500: 0.147171 | accuracy after iteration 1500: 96.836555'

'Cost after iteration 2000: 0.133237 | accuracy after iteration 2000: 96.836555'

'Cost after iteration 2500: 0.123533 | accuracy after iteration 2500: 96.836555'

Awesome !!!

We were able to achieve 96.83% accuracy using logistic regressions.

Confusion Matrix¶

Confusion Matrix is the performance measurement for classification algorithms.It not only provides insight of errors made my model but also the type of erros made.
Some of the basic terminologies related to Confusion Matrix are as follows :
* True Positive — Label which was predicted Positive and is actually Positive.
* True Negatives — Label which was predicted Negative and is actually Negative.
* False Positive (Type 1 Error) — Label which was predicted as Positive, but is actually Negative.
* False Negatives (Type 2 Error)— Labels which was predicted as Negative, but is actually Positive 

Other Metrics that can be computed using Confusion Matrix are :¶

Sensitivity - It is also called as Recall, Hit Rate, True Positive Rate(TPR). It measures the proportion of actual positives that are correctly identified. \begin{align} TPR =\ \frac{TP}{P}\ =\ \frac{TP}{TP+FN} \end{align}
Specificity - It is also called as Selectivity, True Negative Rate(TNR). It measures the proportion of actual negatives that are correctly identified. \begin{align} TNR =\ \frac{TN}{N}\ =\ \frac{TN}{TN+FP} \end{align}
Precision - It is also called as Positive Predictive Value.The precision metric shows the accuracy of the positive class. It measures how likely the prediction of the positive class is correct. \begin{align} Precision=\ \frac{TP}{TP+FP} \end{align}
Accuracy Score- Accuracy is calculated as the number of all correct predictions divided by the total number of the dataset. \begin{align} Accuracy=\ \frac{TP+TN}{TP+TN+FP+FN} \end{align}
F1 Score - The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. \begin{align} F1\ Score =\ \frac{2* Precision * Recall}{Precision+Recall} \end{align}

In [80]:

"""
Here we consider the detection of breast cancer(Y=1) as positive result
and cancer not being detected (Y=0) as negative result

"""
def plot_confusion_matrix(actual,predicted):
    confusion_matrix=np.zeros((2,2))
    confusion_matrix[0][0]=int(((predicted==1)&(actual==1)).sum())
    confusion_matrix[0][1]=int(((predicted==0)&(actual==1)).sum())
    confusion_matrix[1][0]=int(((predicted==1)&(actual==0)).sum())
    confusion_matrix[1][1]=int(((predicted==0)&(actual==0)).sum())

    return confusion_matrix

In [83]:

confusion_matrix=plot_confusion_matrix(Y,Y_pred)
TP=confusion_matrix[0][0]
FN=confusion_matrix[1][0]
FP=confusion_matrix[0][1]
TN=confusion_matrix[1][1]
recall=TP/(TP+FN)
specificity = TN/(TN+FP)
precision=TP/(TP+FP)
accuracy= (TP+TN)/(TP+TN+FP+FN)
f1_score=(2*precision*recall)/(precision+recall)


print("The Confusion Matrix of the output is :")
print(confusion_matrix)
print("The Sensitivity is : ", recall)
print("The Specificity is : ", specificity)
print("The Precision is : ", precision)
print("The Accuracy is : ", accuracy)
print("The F1 Score is : ", f1_score)

The Confusion Matrix of the output is :
[[ 353.    4.]
 [  14.  198.]]
('The Sensitivity is : ', 0.96185286103542234)
('The Specificity is : ', 0.98019801980198018)
('The Precision is : ', 0.98879551820728295)
('The Accuracy is : ', 0.96836555360281196)
('The F1 Score is : ', 0.97513812154696133)

\begin{align} \Large Congratulations\ on\ Completing\ Logistic\ Regression\ !!! \end{align}

Can we use Linear Regression to solve the classification Problem???¶

A linear regression studies a relationship between
◦ a response variable Y and
◦ a single explanatory variable X.
However, Linear Regression doesn't seem to be a good fit for classification problems owing to the following reasons.

Linear Regression assumes normality for the residual errors(which represent the variation in Y) whereas a classification problem is not normally distributed.
The variance (and the standard deviation) for linear regression does not depend on X(input variable).
For classification problems, the mean and the variance depends upon the probability. Thus any input affecting the probability of output, affects the mean and variance. This voilates the principles of linear regression.
Linear Regression deals with continuous variables instead of discrete variables. This may result in an output having value less than 0 or greater than 1.
True probability has a value between 0 and 1.
Thus Linear Regression fails to model true probability.
- We can solve the problem of unbounded output by using the log p(x) to be the linear function of x, so that changing an input variable multiplies the probability by a fixed amount.
  Here, the problem is that logarithms are unbounded in only one direction, and linear functions are not.

In [ ]: